Neural network and kinetic modelling of human genome replication reveal replication origin locations and strengths

In human and other metazoans, the determinants of replication origin location and strength are still elusive. Origins are licensed in G1 phase and fired in S phase of the cell cycle, respectively. It is debated which of these two temporally separate steps determines origin efficiency. Experiments can independently profile mean replication timing (MRT) and replication fork directionality (RFD) genome-wide. Such profiles contain information on multiple origins’ properties and on fork speed. Due to possible origin inactivation by passive replication, however, observed and intrinsic origin efficiencies can markedly differ. Thus, there is a need for methods to infer intrinsic from observed origin efficiency, which is context-dependent. Here, we show that MRT and RFD data are highly consistent with each other but contain information at different spatial scales. Using neural networks, we infer an origin licensing landscape that, when inserted in an appropriate simulation framework, jointly predicts MRT and RFD data with unprecedented precision and underlies the importance of dispersive origin firing. We furthermore uncover an analytical formula that predicts intrinsic from observed origin efficiency combined with MRT data. Comparison of inferred intrinsic origin efficiencies with experimental profiles of licensed origins (ORC, MCM) and actual initiation events (Bubble-seq, SNS-seq, OK-seq, ORM) show that intrinsic origin efficiency is not solely determined by licensing efficiency. Thus, human replication origin efficiency is set at both the origin licensing and firing steps.

Reviewer #2: For the previous version of this paper found the reviewers appreciated the usefulness of the approach and the quality of the results but also raised collectively a number of concerns, ranging from technical comments on methods to the reliance on relatively low-quality MCM mapping data in drawing conclusions about the replication program. In the revised version, the authors have done a very good job of responding substantively to the points raised and have improved significantly the paper (likely increasing its impact). The methods of inferring IPLS given here are interesting and fruitful, even if all aspects (such as the form of Eq. (9)) are not fully understood. I suspect that the conclusions as to the variability of MCM firing propensity will continue to be debated, but this paper will be the starting point for any future discussion. I thus recommend publication in PLOS Computational Biology after the following minor points are addressed: Line 173: "Assuming a linear relation between replication fraction and Sphase duration…"Why? All of the types of models considered have a sigmoidal relation between the S-phase time and replication fraction. Indeed, the discussion in Lines 436-450 concerns just this point….
To check the validity of Eq.1 and Eq.2 we need to have an estimate of the Mean Replication Time in time unit. However experimentaly it is only available in global replicated fraction. This is why we need this assumption. To account for the reviewer comment we modified the sentence into: Line 173 : To experimentally check Equation (1) we need to convert Repli-seq data, expressed in replicated genome fraction, into time units by multiplication by an estimate of S-phase duration TS. This implicitly assumes a linear relation between replicated fraction and time in S phase, although the true relation appears to be sigmoid rather than linear, which may slightly distort very early or very late S-phase data.
Line 178: "… circumvented by data smoothing…"This is precisely what Eq.
(2) does (at a length scale l) And this is what we had in mind. We clarified it as follows: Line 181 : This can be circumvented by data smoothing at the expense of resolution by integrating Eq. (1) at point x over length l leading to: Line 251: "deriving RFD profile from MRT data Eq. (1) by numerical derivative would produce low resolution RFD profiles with amplified noise…."-again, not true if the derivatives were estimated in a more sophisticated way.
Here the low resolution of the RFD profiles obtained here is mainly due to the low resolution of MRT data. We clarified this as follows: Line 253 : Deriving RFD profiles from limited resolution (~100 kb) MRT data would produce low resolution RFD profiles For Lines 178, 251, there is nothing wrong with methods used and I agree that MRT is better at long and RFD at short scales, but it's also true that there is no real advantage of using Eq. (2) rather than Eq. (1) with derivatives smooth over the same length scale.
We agree with this comment, but think that using Eq 2 makes clearer the smoothing process for people without knowledge on the different methods to estimate the derivatives Line 505: Taylor expansion (not extension) We have corrected this.
Lines 517-520 and Eq. (9): The authors' arguments are superficially reasonable, and I am willing to let the point stand. However, the good agreement between the ad hoc exponential in Eq. (9) seen in Fig. 8C, supposedly from combining the time-variation of Ffree(t) with the expected 1/t dependence, makes one wonder….
Unfortunately we were not able to theoretically determine the origin of this agreement.
• It might be clearer to write I_M(x) rather than I_M to emphasize that the profile is something that varies along the genome, as opposed to other quantities such as fork velocity that are assumed constant over the genome.
(The x-dependence is occasionally given, as in Line 481, but not usually.) We changed I_M to I_M(x) or I_M profile in most occurrences of I_M to highlight the spatial dependency • The phrase "reciprocally consistent" should probably be "mutually consistent" as the former suggests MRT and RFD might be inversely related, which is not what is intended.
We changed to mutually consistent. (line 670) Reviewer #3: In the revised manuscript the authors have addressed most of the concerns that were raised in the initial review. The overall performance of their models is impressive, and their results provide a solid foundation on which other researchers can build.However, I suggest that the authors:1) discuss the inconsistencies of the measurements of ORC and MCM components and the difficulties that these inconsistencies create when attempting to draw conclusions about the relative contributions of origin licensing vs firing and2) clarify the role of "confounding parameters" such as transcription and MRT when discounting origin density model. 1. One of the goals of the study is to help elucidate the relative contribution of origin density vs. origin affinity in shaping genome replication in human cells. While the authors' replication modeling performs well at both large (MRT) and smaller scales (RFD), the final determination of the origin affinity vs origin density critically depends on having accurate measurements of ORC and MCM density across the genome. In Table 1 Given inconsistencies in MCM measurements, it is difficult to draw firm conclusions about licensing vs firing models based on these results and the authors should discuss these limitations.
In the discussion we added : Line 660 : However, the high variability of MCM profiles prevents us to totally exclude that when the source of this variability is understood, an improved MCM profile may better predict RFD and MRT.
Lines 123 to 125 The authors write: "However, our comparison of ORC, MCM and RFD profiles of the Raji cell line showed that when confounding parameters such as MRT and transcription status are controlled, ORC and MCM densities are not predictive of Izs." Based on the absence of correlations of MCM3 and RFD in Raji cells R=0.00, it is not surprising that MCM3 does not delineate IZs in their previous study (Kirstein et al 2021), whether or not confounding parameters are taken into account. On the other hand, replication correlated with MCM density based on MCM2 density measurements (Foss 2021) for both MRT (0.52) and RFD(0.41), which could perhaps explain higher MCM density in IZs in that study.Finally, why would one need to take transcription status into account in determining origin density vs origin affinity model. If MCM densities were the sole determinant of replication initiation, and the MCM densities are a reflection of transcription (i.e. MCM are not found within transcribed genes), removing transcription as a "confounding parameter" would immediately discount the origin density model.
We agree with these remarks. To avoid any confusion we have rephrased the sentence as: line 123 : However, our comparison of ORC, MCM and RFD profiles of the Raji cell line showed that at constant MRT and transcription level, ORC and MCM densities are similar in initiation, random replication and termination zones.
Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code -e.g. participant privacy or use of data from a third party-those must be specified.